-
Notifications
You must be signed in to change notification settings - Fork 163
Adds GCSFS Microbenchmarks #722
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
* Fixing Block size and consistency options in Extended GCSFS Open (#34) * Separate versioned and non-versioned tests to use different bucket * Update cleanup logic in tests to empty the bucket instead of deleting the bucket * Add consistency and blocksize for ext gcsfs * Merge conflicts resolution * bucket type paramet for fs test for block size and consistency * removed unused mocks variables from some tests * fixing lint errors * fixed small issue with core tests so it can run with exp flag true --------- Co-authored-by: Mahalaxmibejugam <60227368+Mahalaxmibejugam@users.noreply.github.com> * pytest microbenchmarks for seq and random reads both single multi threaded * multiprocess benchmarks * script to run tests * undo settings for bucket names and logs * benchmark script updates * file size and bucket type decorators * file size configuration * removed zonal config * Added README * Readme update * Moving settings and fixture to tests root * Readme update * Readme update * Ignore benchmark pytests in CI * benchmark hook fix * adding skip tests flag * benchmark plugin conditional enablement * Fixing PR Comments, simplifying the configuration by doing auto gen * Fixing PR Comments * default settings --------- Co-authored-by: Mahalaxmibejugam <60227368+Mahalaxmibejugam@users.noreply.github.com>
|
/gcbrun |
martindurant
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have very little to say on this - I'm excited to see the results and what improvements your efforts can make!
|
|
||
| ## Introduction | ||
|
|
||
| This document describes the microbenchmark suite for `gcsfs`. These benchmarks are designed to measure the performance of various I/O operations under different conditions. They are built using `pytest` and the `pytest-benchmark` plugin to provide detailed performance metrics for single-threaded, multi-threaded, and multi-process scenarios. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Style: comments are harder with these long lines, recommends sticking to 80chr limits.
I would also test against concurrent.interpreter since it is now available in py3.14, and also free-threaded if all the upstream dependencies support it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
concurrent.interpreter is something i will take in the future PRs
| You can install them using pip: | ||
|
|
||
| ```bash | ||
| pip install -r gcsfs/tests/perf/microbenchmarks/requirements.txt |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Many are using tools like uv and others to specify the dependencies and run command in a more declarative way rather than descriptive README text like this. Might be worth thinking about.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i have added dependencies to env file as well for conda run to work OOB, i guess for non conda scenarions we will have to live with requirements for now. This is something we can revisit in the future.
| benchmark_plugin_installed = False | ||
|
|
||
|
|
||
| @pytest.fixture |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Session scope?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We want this fixture to create new files on every test run, hence keeping it functions scoped. If we create files once per session, then the results will include caching done by GCS which we don't want here.
| def wrapper(): | ||
| base_cases = base_cases_func() | ||
| new_cases = [] | ||
| for case in base_cases: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can't this kind of thing be done with parametrize?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is now moved to yaml, so this is no longer the case
| return wrapper | ||
|
|
||
|
|
||
| def _get_bucket_name_for_type(bucket_type: str) -> str: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is just a dictionary :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Moved to a map in conftest so it can be reused in other tests also in the future
| return wrapper | ||
|
|
||
|
|
||
| def with_threads(base_cases_func: Callable) -> Callable: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also seems like code duplication, suggests these generator functions can be factored.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The generators are now removed, moved configuration to a yaml and converted the yaml values to parameters
| @with_bucket_types(["regional", "zonal"]) | ||
| @with_file_sizes | ||
| @with_threads | ||
| @with_processes |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Perhaps it doesn't matter, since this will only get written once, but decorators are less easy to understand that simple sequential code. This ends up working like parametrized - which is fine and a well established pattern. But I wonder if it could have been written simpler within the function.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same as above, removed the decorators and moved to yaml based config
| @@ -0,0 +1,7 @@ | |||
| # Dependencies for running the performance benchmarks | |||
| gcsfs | |||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should this refer to the local directory explicitly? We are assuming the user has already installed gcsfs from source.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done, thanks for the feedback.
* Fixing Block size and consistency options in Extended GCSFS Open (#34) * Separate versioned and non-versioned tests to use different bucket * Update cleanup logic in tests to empty the bucket instead of deleting the bucket * Add consistency and blocksize for ext gcsfs * Merge conflicts resolution * bucket type paramet for fs test for block size and consistency * removed unused mocks variables from some tests * fixing lint errors * fixed small issue with core tests so it can run with exp flag true --------- Co-authored-by: Mahalaxmibejugam <60227368+Mahalaxmibejugam@users.noreply.github.com> * pytest microbenchmarks for seq and random reads both single multi threaded * multiprocess benchmarks * script to run tests * undo settings for bucket names and logs * benchmark script updates * file size and bucket type decorators * file size configuration * removed zonal config * Added README * Readme update * Moving settings and fixture to tests root * Readme update * Readme update * Ignore benchmark pytests in CI * benchmark hook fix * adding skip tests flag * benchmark plugin conditional enablement * Fixing PR Comments, simplifying the configuration by doing auto gen * Fixing PR Comments * default settings * added resource monitoring for benchmarks * minor refactoring * moved config to yaml * lint fixes * config yaml update for full read suite * undo zonal file logging changes * simplify single threaded read * bringing back requirements * update readme * csv generation fix when some tests fail * psutil install in cloudbuild * psutil install in cloudbuild * Removing zonal conditional code and updating config --------- Co-authored-by: Mahalaxmibejugam <60227368+Mahalaxmibejugam@users.noreply.github.com>
|
/gcbrun |
|
/gcbrun |
* Fixing Block size and consistency options in Extended GCSFS Open (#34) * Separate versioned and non-versioned tests to use different bucket * Update cleanup logic in tests to empty the bucket instead of deleting the bucket * Add consistency and blocksize for ext gcsfs * Merge conflicts resolution * bucket type paramet for fs test for block size and consistency * removed unused mocks variables from some tests * fixing lint errors * fixed small issue with core tests so it can run with exp flag true --------- Co-authored-by: Mahalaxmibejugam <60227368+Mahalaxmibejugam@users.noreply.github.com> * pytest microbenchmarks for seq and random reads both single multi threaded * multiprocess benchmarks * script to run tests * undo settings for bucket names and logs * benchmark script updates * file size and bucket type decorators * file size configuration * removed zonal config * Added README * Readme update * Moving settings and fixture to tests root * Readme update * Readme update * Ignore benchmark pytests in CI * benchmark hook fix * adding skip tests flag * benchmark plugin conditional enablement * Fixing PR Comments, simplifying the configuration by doing auto gen * Fixing PR Comments * default settings * added resource monitoring for benchmarks * minor refactoring * moved config to yaml * lint fixes * config yaml update for full read suite * undo zonal file logging changes * simplify single threaded read * bringing back requirements * update readme * csv generation fix when some tests fail * psutil install in cloudbuild * psutil install in cloudbuild * Removing zonal conditional code and updating config * parallel file creation in setup * merge conflicts * merge conflicts - lint fixes * lint issues --------- Co-authored-by: Mahalaxmibejugam <60227368+Mahalaxmibejugam@users.noreply.github.com>
|
/gcbrun |
|
/gcbrun |
|
/gcbrun |
This PR introduces a comprehensive microbenchmark suite for
gcsfsto measure the performance of I/O operations under various conditions. The framework is built usingpytestand thepytest-benchmarkplugin, providing detailed performance metrics for single-threaded, multi-threaded, and multi-process scenarios.Key Changes:
Microbenchmark Framework: A new microbenchmark suite has been added under
gcsfs/tests/perf/microbenchmarks/. It includes:README.mdwith instructions on how to set up and run the benchmarks.run.pythat simplifies the process of running benchmarks and generating reports.conftest.pyfor setting up and tearing down the test environment, including creating temporary files of specified sizes.Read Benchmarks: The initial implementation focuses on read performance with the following features:
seq) and random (rand) read patterns.Configuration and Settings:
gcsfs/tests/perf/microbenchmarks/read/parameters.pyand can be customized using environment variables as defined ingcsfs/tests/settings.py.GCSFS_BENCHMARK_SKIP_TESTSenvironment variable tofalse.Reporting:
run.pyscript generates a timestamped directory containing detailed results in both JSON and CSV formats.For Reviewers:
gcsfs/tests/perf/microbenchmarks/read/test_read.py.gcsfs/tests/perf/microbenchmarks/read/configs.pyandgcsfs/tests/settings.py.README.mdfile in the microbenchmarks directory should be reviewed for clarity and completeness.This new benchmark suite will be a valuable tool for a more data-driven performance analysis and to ensure that
gcsfsremains highly performant.